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Abstract 

Background: With an abundant amount of microarray gene expression data sets available through public 
repositories, new possibilities lie in combining multiple existing data sets. In this new context, analysis itself is no 
longer the problem, but retrieving and consistently integrating all this data before delivering it to the wide variety of 
existing analysis tools becomes the new bottleneck. 

Results: We present the newly released inSilicoMerging R/Bioconductor package which, together with the 
earlier released inSilicoDb R/Bioconductor package, allows consistent retrieval, integration and analysis of 
publicly available microarray gene expression data sets. Inside the inSilicoMerging package a set of five visual 
and six quantitative validation measures are available as well. 

Conclusions: By providing (i) access to uniformly curated and preprocessed data, (ii) a collection of techniques to 
remove the batch effects between data sets from different sources, and (iii) several validation tools enabling the 
inspection of the integration process, these packages enable researchers to fully explore the potential of combining 
gene expression data for downstream analysis. The power of using both packages is demonstrated by 
programmatically retrieving and integrating gene expression studies from the InSilico DB repository [https:// 
insilicodb.org/app/]. 

Keywords: Batch effect removal, Data integration, Gene expression, Microarray repositories, InSilico DB, 
Reproducibility 



Background 

An increasing amount of gene expression data sets 
is becoming available through public repositories (e.g., 
NCBI GEO [1,2], ArrayExpress [3]). All this new infor- 
mation offers potential clues for treatment of different 
diseases but one of the current challenges in the field is to 
unlock the potential of this data. Combining a large num- 
ber of gene expression data sets originating from different 
labs could be beneficial for the discovery of new biological 
insights and could increase the statistical power of gene 
expression analysis, but then this data should be combined 
in a consistent manner. 
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The first hurdle for large-scale gene expression anal- 
ysis is obtaining access to consistently annotated and 
preprocessed data [4]. Inside the inSilico Project 1 the 
inSilicoDb R/Bioconductor package was released [5], 
offering programmatical access to all expertly-curated 
and uniformly preprocessed expression profiles from the 
InSilico DB repository [6]. 

However, accessing uniformly represented data is only 
the first step when combining and integrating gene 
expression data sets since the use of different experimen- 
tation plans, platforms and methodologies by different 
research groups introduces undesired batch effects in the 
gene expression measurements [7] thus severely hinder- 
ing downstream analysis. The problems raised by the 
batch specific unwanted variation as well as the potential 
sources leading to batch effects have already been revealed 
and widely discussed in a number of publications 1 [8-14]. 
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It is clear that it is necessary to incorporate the adjust- 
ment for batch effects as a standard step in the analysis of 
high- throughput data [15] and several methods to remove 
this specific type of bias - while preserving the biologi- 
cal variance - have been proposed these last years [16-19]. 
Unfortunately, there is no single best method yet that can 
be used in all situations. 

We present here the inSilicoMerging R/Bio- 
conductor package, which combines several of the most 
used methods to remove the unwanted batch effects in 
order to combine, or merge different data sets in an intu- 
itive and user-friendly manner. 

Evaluating and validating the results of batch effect 
removal methods is perhaps as important as the batch 
effect removal process itself. Without good and reli- 
able evaluation tools, these methods could result in an 
even increased distortion of the data, introducing serious 
errors in the results of any downstream analysis per- 
formed. Therefore, five simple but powerful visual inspec- 
tion tools and six quantitative measures to evaluate the 
different batch-effect removal methods, are provided as 
well. 

Although this package can be used as a stand-alone 
tool, its power lies in combining it with tools like the 
inSilicoDb package, thereby paving the way towards 
large-scale meta-analysis of gene expression repositories. 

Implementation 

The inSilicoMerging package is part of the open- 
source Bioconductor [20] project since release 2.10. 

Data formats 

A conscious choice has been made to work with the 
internal data formats of Bioconductor for handling 
gene expression data objects, namely ExpressionSets. 
Each function can take as input a list of ExpressionSet 
objects and the returned value of the different merging 
methods is a single ExpressionSet, containing the merged 
data. 

Link with InSilico DB 

There is a programmatic link with the InSilico DB 
genomics data hub [6] through the R/Bioconductor 
inSilicoDb package [5], providing uniformly curated 
and preprocessed gene expression data. All genomic 
data sets are formatted as fully annotated Expres- 
sionSets which can be used immediately within the 
inSilicoMerging package. Examples in the next 
section will illustrate this seamless integration. 

On the browse page of InSilico DB [https://insilicodb. 
org/app/browse] it is also possible to download gene 
expression data sets in the same format. After loading 
these data sets in R, they can similarly be used within 
inSilicoMerging. 



Implementation of merging algorithms 

A set of six batch effect removal methods for merging gene 
expression data is currently provided: BMC (Batch Mean- 
Centering [21]), COMBAT (Empirical Bayes [18]), DWD 
(Distance-Weighted Discrimination [17]), GENENORM 
(Z-score standardization), NONE and XPN (Cross- 
Platform Normalization [19]). For a detailed description 
of the major algorithms we refer to the original publica- 
tions and in addition each of those six methods is also 
described in the accompanying package vignette. For 
more information about merging through batch effect 
removal in general please consult [22]. 

The implementation of COMBAT was based on the 
orginal R code made available by the authors on their 
website [http://www.bu.edu/jlab/wp-assets/ComBat/Down 
load_files/ComBat.R]. The implementation of DWD uses 
the kDWD function from the corresponding package [23] . 
XPN has been implemented based on Matlab code pro- 
vided by the authors [https://genome.unc.edu/xpn/]. The 
remaining merging algorithms were implemented using 
basic R functions. All implementations were tested by 
empirical validation and when possible, by comparing 
results with those from the original implementations. 

Note that gene expression data sets from different plat- 
forms can be merged by keeping only the common fea- 
tures (genes or probe sets) in the merged data set. Some 
merging techniques are only reported and implemented to 
merge exactly two studies (e.g., XPN and DWD). To merge 
multiple studies, an additional step was added in which all 
studies were combined two-by-two in a recursive way. 

Implementation of validation algorithms 

For the visual inspection of merging results, five quali- 
tative validation methods are provided. In addition, six 
quantitative validation methods are provided as well. 
These quantitative indices provide a more accurate eval- 
uation of the batch effect removal and they are very 
effective tools for comparing the results of different meth- 
ods. Below a brief description of each validation method 
is provided, for an extensive survey we again invite the 
reader to consult [22]. 

Qualitative validation methods 

Five qualitative validation methods are implemented: 

plotMDS: creates a double-labeled Multidimensional 
Scaling (MDS) plot. 

plot RLE: creates a relative log expression (RLE) plot, 
initially proposed to measure the overall quality of a data 
set [24] but also useful in this context. 

plotGeneWiseBoxPlots: provides a local visualiza- 
tion by looking at the box plots of a specific gene across all 
samples. 
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plotDendrogram: creates a sample-wise hierarchical 
clustering plot. 

plotGeneWiseDensity: is another local visualization 
tool which plots the estimated density of genes across 
different data sets. 

If available we extended the relevant existing functions 
from R and Bioconductor. 

Quantitative validation methods 

Six quantitative validation indices were implemented, 
based on the description in [19]: 

measureAsymmetry: This measure compares the sam- 
ple asymmetry by calculating the skewness before and 
after batch effect removal. Good merging methods should 
not alter the samples' skewness. 

measureSamplesMeanCorrCoef : This measure 
compares the correlation between the same samples 
before and after batch effect removal. Good merg- 
ing methods should maintain similarity and therefore 
correlation should be high. 

measureGenesMeanCorrCoef : This measure com- 
pares the correlation between the same genes before and 
after batch effect removal. Good merging methods should 
maintain similarity and therefore correlation should be 
high. 

measureSignif icantGenesOverlap: This mea- 
sure compares the number of known control genes found 
in the top ranked genes according to a particular method 
for differential expressed genes discovery in the adjusted 
study with those found in the original studies. 

measureSamplesOverlap: This measure is an indi- 
cation of how well the samples from each study are mixed 
after batch effect removal. For each array in each study the 
distance to the nearest array in all other studies is calcu- 
lated. The average of these distances is used as a measure. 
The lower the average distance, the better the overlap, and 
the better the quality of the batch effect removal. 

measureGenesOverlap: We also implemented a 
novel quantitative validation index which calculates the 
average difference in the distribution of all genes in the 
individual studies as follows: 



GOVi = 



El**, 



(i) 



gi in the first respectively second data set and they are 
empirically estimated using the Parzen-Rosenblatt density 
estimation method [25]. Note that this index is bounded 
in [0 1], the minimum value being obtained when the 
two distributions are identical while the maximum value 
is reached when the two distributions are completely sep- 
arated. The global cross-studies genes' overlapping index 
is given by: 



GOV = 



m 



(2) 



where P x . and P y . are the normalized probability density 
functions (such that • P x . = 1 and • P y . = 1) of gene 



where m is the number of common genes between the two 
data sets. Note that GOV is still bounded in [0 1], pro- 
viding a clear quantification of the batch effect or of the 
quality of the data integration process. 

Results and discussion 

In the following sections we demonstrate the ease and 
power of the inSilicoMerging package based on an 
example analysis of two lung cancer data sets assayed 
on different platforms. We will discuss a complete work- 
flow consisting of search and retrieval of data, followed by 
merging and corresponding validation. 

Accessing Data 

We start our example analysis by using the inSilicoDb 
package to search for data sets containing lung cancer 
samples, assayed on two different platforms: Affymetrix 
Human Genome U133A Array (GPL96) and Affymetrix 
Human Genome U133 Plus 2.0 Array (GPL570). We 
restrict our search to data sets with the number of samples 
between 100 and 150 (the select function). All data sets 
retrieved from the InSilico DB database are consistently 
preprocessed using frozen RMA [26], as explained in [5]. 

> library (" inSilicoDb" ) ; 

> lung96 = getDatasetList ( "GPL96 " , 
"FRMA", query="lung cancer"); 

> lung570 = getDatasetList ( "GPL570 " , 
"FRMA", query="lung cancer") ; 

> select = function (x, platform) { 
+ n = nrow (pData (getAnnotations (x, 
platform) ) ) ; 

+ if(n > 100 & n < 150) { return(x); 
} 

+ }; 

> lung96 = unlist ( sapply ( lung96 , 
select, platform="GPL96") ) ; 

" GSE2 6 03" " GSE1 0 072" " GSE3 218" 

> lung570 = unlist ( sapply ( lung570 , 
select, platform="GPL570") ) ; 
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"GSE83 3 2" "GSE1966 7" "GSE198 04" 
"GSE194 0 7" "GSE2 02 57" 

For both platforms we select one study from the above 
lists and retrieve the fully annotated and preprocessed 
data set: 

> eset570 = getDataset ( "GSE19804 " , 
"GPL570" , " FRMA" , genes=TRUE) ; 

> eset96 = getDataset ( "GSE10072 " , 
"GPL96 " , "FRMA", genes=TRUE) ; 

> table (pData (eset570) [, "Disease"] ) ; 

control lung cancer 

49 58 

> table (pData (eset96) [, "Disease "]) ; 

control lung cancer 

60 60 

Both studies contain normal and tumor samples with a 
common Disease annotation label and were selected for 
this example because they are almost equally divided in 
both disease subdivisions as can be seen in the previous 
code example. 

Batch Effect Removal and Merging 

The different merging methods implemented in the 
inSilicoMerging package can now all be applied on 
the retrieved ExpressionSets by calling a single function 
merge. For example, simple combination without batch 
effect removal can be done as follows: 

> merge (esets, method= "NONE " ) ; 

with esets a list of ExpressionSet objects. 
The method parameter can be one of the options 
described above: BMC, COMBAT, DWD, GENENORM, NONE 
or XPN. 

Next to the merged set where no transformation (NONE) 
is applied and we also build a merged set using the 
COMBAT method, hence creating two different Expres- 
sionSet objects containing merged data set: 

> library ( inSilicoMerging) ; 

> esets = list (eset570 , eset96); 

> esetJNONE = merge (esets, 
method="NONE" ) ; 

> eset_COMBAT = merge (esets, 
method=" COMBAT" ) ; 

> table (pData (esetJNONE) [, "Disease" ] ) 

control lung cancer 
109 118 

Validation 

Five visual validation methods are provided by the pack- 
age in order to immediately enable a first inspection of 



the merged data sets. For this example, a first global 
visual inspection of the two merging approaches applied 
is demonstrated via the plotMDS function. 



> par (mf row=c ( 1 , 2 ) ) ; 

> plotMDS (esetJNONE, 

+ targetAnnot = "Disease", 

+ batchAnnot = "Study", 

+ main = "NONE (No 

Transformation) ") ; 

> plotMDS (eset_COMBAT, 

+ targetAnnot = "Disease", 

+ batchAnnot = "Study", 

+ main = "COMBAT" ) ; 



In Figure 1 the two MDS plots are shown. Samples are 
labeled by color, based on the target biological variable 
of interest (normal versus lung cancer) and are labeled 
by symbol, based on the study they originate from. This 
information was retrieved from the curated phenotype 
information of the data sets using the targetAnnot and 
batchAnnot arguments, respectively. No extra manual 
step was required. 

We can observe the effect of using COMBAT on the 
right part of Figure 1, where samples are more clustered 
together based on the target biological variable of interest 
(color), instead of the study they originate from (symbol). 
On the left part, if no merging technique is applied, a huge 
inter-study bias can be observed, hindering any further 
analysis of this combined data set. 

Similarly, on a local scale we demonstrate the use 
of the plotGeneWiseDensity function for a specific 
gene. 

> par (mf row=c ( 1 , 2 ) ) ; 

> plotGeneWiseDensity (eset JNFONE , 

+ batchAnnot 
= "Study", 

+ gene = 

"MYL4 " , 

+ main = 

"NONE (No Transformation)"); 

> plotGeneWiseDensity (eset_COMBAT, 

+ batchAnnot 
= "Study", 

+ gene = 

"MYL4 " , 

+ main = 

"COMBAT") ; 

Gene MYL4 was arbitrary selected and its estimated 
density across both studies is shown in Figure 2. We can 
observe that the COMBAT method makes the distributions 
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NONE (No Transformation) 



COMBAT 




Figure 1 MDS plots. Visual inspection of two merged data sets using double-labeled Multi Dimensional Scaling (MDS) plots. In these MDS plots 
samples are labeled by color based on the target biological variable of interest and are labeled by symbol based on the study they originate from. 
On the left the two data sets are merged without any transformation and on the right the two data sets are merged by using the COMBAT method. 
It is intuitively clear from the MDS plots that samples cluster by study without any transformation and by disease after performing COMBAT. 



of this specific gene across the two studies more similar 
to each other. Note that the magnitude of the batch effect 
at the gene level is very dependent on the specific gene 
selected. 

R code illustrating the five visual inspection meth- 
ods is provided in the Additional file 1. The results 
for the three remaining methods can be found in 
Figures 3, 4, 5. 



Comparison with existing tools 

We compare inSilicoMerging, together with 
inSilicoDb, on three axes: data access, data merging 
and validation algorithms. 

Data access 

GEO Query [27] is an existing package to retrieve gene 
expression data sets in R/Bioconductor. However, the 



NONE (No Transformation) 



COMBAT 




Figure 2 Genewise density plots. Visual inspection of two merged data sets using gene-wise density plots. For the randomly selected MYL4 gene, 
density plots in each study are shown, colored by study. On the left the two data sets are merged without any transformation and on the right the 
two data sets are merged by using the COMBAT method. The genewise density plots show that after transformation the distribution is much more 
similar. 
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NONE (No Transformation) 



COMBAT 




Figure 3 RLE plots. Visual inspection of two merged data sets using relative log expression plots. In these relative log expression plots samples are 
colored by study. For clarity purposes only 40 randomly selected samples are shown. On the left the two data sets are merged without any 
transformation and on the right the two data sets are merged by using the COMBAT method. After applying COMBAT the mean of the RLE is 
approximately 0 for all genes which indicates a good batch effect removal. 



information about the samples is in a raw form requir- 
ing a manual curation step in transit between a data 
repository (e.g., GEO) and a data analysis platform 
(e.g., R/Bioconductor). A lack of standardized phenotypic 
meta-data and inconsistent preprocessing can hinder the 
integration and combination of different raw form data 
sets. 



Data merging 

Several of the batch effect removal techniques included 
in the inSilicoMerging package have been imple- 
mented prior to this package in different programming 
and/or script languages (e.g., R, Matlab, Java). A major 
advantage of inSilicoMerging is the consistent inter- 
face for all merging methods and its integration in the 
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Figure 4 Genewise box plots. Visual inspection of two merged data sets using a gene-wise box plots. Boxplots of the randomly selected MYL4 
gene are grouped by study and colored by the target biological variable of interest. On the left the two data sets are merged without any 
transformation and on the right the two data sets are merged by using the COMBAT method. After batch effect removal the distribution of the gene 
is much more similar between studies than without. 
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Figure 5 Dendrogram plots. Visual inspection of two merged data sets using dendrograms plots. In these dendrogram plots samples are labeled 
by a number corresponding to the study they originate from. For clarity purposes only 40 randomly selected samples are used to perform the 
hierarchical clustering. On the left the two data sets are merged without any transformation and on the right the two data sets are merged by using 
the COMBAT method. In the right plot it can be seen that samples originating from different studies are mixed, while on the left they are grouped 
per study. 



R/Bioconductor framework. We will briefly discuss the 
added value of our package compared to those already 
available in R. 

The authors of the COMBAT method provide an imple- 
mentation on their website which allows the input of 
known covariates (besides batch id). They work with sev- 
eral matrices containing the numerical data, and relevant 
annotations and covariates. Within our package, all this 
information is bundled and encoded in the ExpressionSet 
structure. 

The authors of the DWD method recently have released 
the DWD R/Bioconductor package [23]. This package is 
intended as a general use of the DWD technique and is not 
specific for the goal of batch effect removal. It is therefore 
not straightforward to use as a merging technique for gene 
expression data sets and once again the relevant data is 
dispersed over several objects. In addition, the necessary 
transformation of the gene expression values was added 
to the inSilicoMerging package in order to obtain a 
merged data set as result. 

The CONOR package which is available through the 
CRAN repository [28], most closely resembles our soft- 
ware as it includes multiple batch effect removal meth- 
ods and several methods based on discretizing the gene 
expression data. However, it lacks the user friendliness 
provided by our package and the direct integration with 
the Bioconductor framework. 

Validation 

Evaluating and validating the results of batch effect 
removal methods is perhaps as important and difficult 



as the batch effect removal process itself. Without good 
and reliable evaluation tools, these methods could result 
in an increased distortion of the data, introducing seri- 
ous errors in the results of any downstream analysis 
performed. 

To the best of our knowledge no explicit software has 
been developed for validating the quality of the batch 
effect removal process of microarray gene expression data 
sets. The collection of both qualitative and quantitative 
validation methods is therefore a unique property of the 
inSilicoMerging package. 

Conclusions 

The inSilicoMerging R/Bioconductor package 
enables the integration of batch adjustment in any work- 
flow dealing with large-scale analysis of gene expression 
data, paving the road towards unlocking the potential 
of the massive amount of publicly available microar- 
ray data. Moreover, its seamless integration with the 
R/Bioconductor framework for further analysis and 
the inSillicoDb package for retrieval of consistent 
data, minimizes the need for manual intervention. As 
a result transparency, trackability and reproducibility 
of large-scale gene expression analysis is significantly 
increased. 

Availability and requirements 

Project name: inSilicoMerging 

Project home page: http://www.bioconductor.org/pack 

ages/release/bioc/html/inSilicoMerging.html 

Operating system(s): Platform independent 
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Programming language: R 

Other requirements: Bioconductor 

License: GPL-2 

Any restrictions to use by non-academics: none 
Endnote 

1 www.insilico.ulb.ac.be 
Additional file 



Additional file 1 : R script. code_example . R R script using both 
inSilicoDb and inSilicoMerging packages generating Figure 1-5. 



Competing interests 

The authors declare that they have no competing interests. 
Author's contributions 

JT, SM and CL designed and implemented the inSilicoMerging package and 
drafted the manuscript. JT, DS and RD designed and implemented the 
inSilicoDb package. AC, CM and DWS were involved in the development of 
the back-end software. VdS provided the expert-cu rated annotation data of 
the underlying database tool. HB and AN conceived the study and 
participated in its design and coordination. All authors read and approved the 
final manuscript. 

Acknowledgements 

This work was partially funded by the Brussels Institute for Research and 
Innovation (INNOVIRIS). 

Author details 

1 Al (CoMo), Vrije Universiteit Brussel, Pleinlaan 2, 1050 Brussels, Belgium. 
2 IRIDIA, Universite Libre de Bruxelles, Avenue F. D. Roosevelt 50, 1 050 Brussels, 
Belgium. 

Received: 30 May 201 2 Accepted: 1 8 December 201 2 
Published: 24 December 201 2 

References 

1 . Edgar R, Domrachev M, Lash AE: Gene Expression Omnibus: NCBI gene 
expression and hybridization array data repository. Nucleic Acids Res 

2002, 30:207-10. 

2. Barrett T, Troup DB, Wilhite SE, Ledoux P, Evangelista C, Kim IF, 
Tomashevsky M, Marshall KA, Phillippy KH, Sherman PM, et al.: NCBI GEO: 
archive for functional genomics data sets-10 years on. Nucleic Acids 
Res 201 1, 39(Database issue):D1 005-10. 

3. Parkinson H, Sarkans U, Kolesnikov N, Abeygunawardena N, BurdettT, 
Dylag M, Emam I, Fame A, Hastings E, Holloway E, et al.: ArrayExpress 
update-an archive of microarray and high-throughput 
sequencing-based functional genomics experiments. Nucleic Acids 
Res 201 1 , 39(Database issue):D1 002-4. 

4. Larsson O, Sandberg R: Lack of correct data format and comparability 
limits future integrative microarray research. Nat Biotechnol 2006, 
24(1 1):1 322-3. 

5. Taminau J, Steenhoff D, Coletta A, Meganck S, Lazar C, de Schaetzen V, 
Duque R, Molter C, Bersini H, Nowe A, Weiss Solis, DY: inSilicoDb: an 
R/Bioconductor package for accessing human Affymetrix 
expert-curated datasets from GEO. Bioinformatics 201 1 , 27(22):3204-5. 

6. Coletta A, Molter C, Duque R, Steenhoff D, Taminau J, de Schaetzen V, 
Meganck S, Lazar C, Venet D, Detours V, Nowe A, Bersini H, Solis DYW: 
InSilico DB genomic datasets hub: an efficient starting point for 
analyzing genome-wide studies in GenePattern, Integrative 
Genomics Viewer, and R/Bioconductor. Genome Biol 201 2, 1 3(1 1 ):R1 04. 

7. Scherer A(Ed): Batch Effects and Noise in Microarray Experiments: Sources 
and Solutions. Chichester, UK: John Wiley and Sons; 2009. 

8. Chu TM, Deng S, Wolfinger R, et al.: Cross-site comparison of gene 
expression data reveals high similarity. Environ Health Perspect 2004, 
112(4):449-455. 



9. Dobbin KK, Beer DG, Meyerson M, et al.: Interlaboratory Comparability 
Study of Cancer Gene Expression Analysis Using Oligonucleotide 
Microarrays. Clinical Cancer Research 2005, 1 1 (2):565-572. 

1 0. Zakharkin S, Kim K, Mehta T, et al.: Sources of variation in Affymetrix 
microarray experiments. BMC Bioinformatics 2005, 6:214. 

1 1. Han ES, Wu Y, McCarter R, et al.: Reproducibility, Sources of Variability, 
Pooling, and Sample Size: Important Considerations for the Design 
of High-Density Oligonucleotide Array Experiments. The Journals of 
Gerontology Series A: Biological Sciences and Medical Sciences 2004, 
59(4):B306-B315. 

1 2. Breit S, Nees M, Schaefer U, et al.: Impact of pre-analytical handling on 
bone marrow mRNA gene expression. British Journal of Haematology 

2004,126(2). 

13. Bakay M, Chen YW, Borup R, et al.: Sources of variability and effect of 
experimental approach on expression profiling data interpretation. 

BMC Bioinformatics 2002, 3:4. 

1 4. Brown JS, Kuhn D, Wisser R, et al.: Quantification of sources of variation 
and accuracy of sequence discrimination in a replicated microarray 
experiment. Biotechniques 2004, 36:324-332. 

1 5. Leek JT, Scharpf RB, Bravo HC, Simcha D, Langmead B, Johnson WE, 
Geman D, Baggerly K, Irizarry RA: Tackling the widespread and critical 
impact of batch effects in high-throughput data. Nat Rev Genet 201 0, 
11(10):733-9. 

1 6. Sims AH, Smethurst GJ, Hey Y, Okoniewski MJ, Pepper SD, Howell A, Miller 
CJ, Clarke RB: The removal of multiplicative, systematic bias allows 
integration of breast cancer gene expression datasets - improving 
meta-analysis and prediction of prognosis. BMC medical genomics 

2008,1:42. 

1 7. Benito M, Parker J, Du Q, Wu J, Xiang D, Perou CM, Marron JS: Adjustment 
of systematic microarray data biases. Bioinformatics 2004, 20:1 05-1 4. 

1 8. Johnson WE, Li C, Rabinovic A: Adjusting batch effects in microarray 
expression data using empirical Bayes methods. Biostatistics 2007, 

8:118-27. 

1 9. Shabalin AA, Tjelmeland H, Fan C, Perou CM, Nobel AB: Merging two 
gene-expression studies via cross-platform normalization. 
Bioinformatics 2008, 24(9): 1 1 54-60. 

20. Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, 
Gautier L, Ge Y, Gentry J, et al.: Bioconductor: open software 
development for computational biology and bioinformatics. 
Genome Biol 2004, 5(1 0):R80. 

21 . Sims A, Smethurst G, Hey Y, Okoniewski M, Pepper S, Howell A, Miller C, 
Clarke R: The removal of multiplicative, systematic bias allows 
integration of breast cancer gene expression datasets - improving 
meta-analysis and prediction of prognosis. BMC Medical Genomics 
2008,1:42. 

22. Lazar C, Meganck S, Taminau J, Steenhoff D, Coletta A, Molter C, Weiss 
Solis DY, Duque R, Bersini H, Nowe A: Batch effect removal methods for 
microarray gene expression data integration: a survey. Briefings in 
Bioinf 2012. doi:10.1093/bib/bbs037. [http://bib.oxfordjournals.org/ 
content/early/201 2/07/31 /bib.bbs037.abstract] 

23. Huang H, Lu X, Liu Y, Haaland P, Marron JS: R/DWD: distance-weighted 
discrimination for classification, visualization and batch 
adjustment. Bioinformatics 201 2, 28(8): 1 1 82-3. 

24. Brettschneider J, Collin F, Bolstad BM, Speed TP: Quality Assessment for 
Short Oligonucleotide Microarray Data. Technometrics 2008, 
50(3):241-264. 

25. Parzen E: On Estimation of a Probability Density Function and Mode. 

The Ann Math Stat 1 962, 33(3): 1 065. 

26. McCall MN, Bolstad BM, Irizarry RA: Frozen robust multiarray analysis 
(fRMA). Biostatistics 201 0, 1 1 (2):242-53. 

27. Sean D, Meltzer PS: GEOquery: a bridge between the Gene Expression 
Omnibus (GEO) and BioConductor. Bioinformatics 2007, 23(1 4):1 846-7. 

28. Rudy J, Valafar F: Empirical comparison of cross-platform 
normalization methods for gene expression data. BMC Bioinformatics 
2011,12:467. 



doi:1 0.1 1 86/1 47 1 -21 05-1 3-335 

Cite this article as: Taminau et al.: Unlocking the potential of publicly avail- 
able microarray data using inSilicoDb and inSilicoMerging R/Bioconductor 
packages. BMC Bioinformatics 201 2 13:335. 
V J 



